The Intuit Data Journey
Accelerating Development of Smart, Personalized Financial Products & Services
This blog post is co-authored by Intuit Data Engineering Vice President, Mammad Zadeh, and Intuit Core Services & Experiences Senior Vice President, Raji Arasu.
Our mission at Intuit is to power prosperity around the world as an AI-driven expert platform company, by addressing the most pressing financial challenges facing our consumer, small business and self-employed customers. Our offerings, including TurboTax, QuickBooks and Mint, help them make more money, with the least amount of work, while having complete confidence in their actions and decisions.
Data capabilities are foundational to this platform, which collects, processes, and transforms a steady stream of raw data into a connected mesh of high quality data. The resulting data accelerates development of personalized smart products for our customers, while increasing productivity for internal data consumers.
Over the past three years, we have developed the foundations for a world-class data, machine learning and analytics platform that has led to a ten-fold increase in the amount of data being processed in the cloud and the number of models deployed in our products. This is important, now more than ever, as recent global events and economic uncertainties have created unprecedented challenges for our customers. Intuit recently launched several new innovations to help consumers and small businesses with U.S. government aid and relief programs. The speed at which we delivered these new solutions would not have been possible without our data and machine learning (ML) platform, along with the AI to help identify users and businesses benefiting from these programs.
A Tipping Point in our Journey
Back in late 2017, Intuit’s evolution from a collection of internally developed and acquired flagship products to a suite of connected financial services was in full swing, resulting in the rapid growth of our customer base around the globe. At the same time, internal demand for access to high quality and real-time data was growing exponentially, pushing our legacy data infrastructure and platform operability to its limits. Refreshing our infrastructure was expensive and we had no compute elasticity to support data-heavy jobs, such as real-time and batch processing, feature extraction and ML model training. It used to take us several quarters to roll out a new ML model. We needed to hire specialized talent to maintain and operate proprietary infrastructure solutions. Our data was siloed, and we lacked reliable tools to assist with data discovery, lineage, and cleansing. Supporting the ever-growing data worker community of engineers, analysts and data scientists was not sustainable by a small data platform team at the speed at which the company needed to innovate.
We had reached a tipping point where our legacy data infrastructure was no longer scaling. We needed a new strategy and a new approach.
A Refreshed Data Strategy
We refreshed our data strategy to center around the following key themes:
All in the cloud. Our data and infrastructure needed to leverage cloud and cloud native technologies for scale, speed and elasticity. We needed to extend and integrate these technologies to other back-end systems to build a secure, reliable and self-service data and ML platform for our data consumers.
Data as a product. We needed to accelerate building great products, which meant less time onboarding to data systems, or moving and transforming data. We needed to take a product-centric view towards data and ML capabilities for both data producers and consumers, which meant building easy to use data products with best-in-class SLAs (quality, availability, performance, security, and cost-effectiveness). This required a deep understanding of the workflows of our data producers and consumers for which we conducted many Follow me Home sessions to understand the data products we needed to build.
Paved road for data. A plethora of technologies strewn across the company, many of which were not mature enough for the enterprise and certainly not operating as an integrated ecosystem, were slowing us down, resulting in wasted overlapping efforts. We declared converging to a paved road as a priority. It was important to align the company on where we had to converge (fixed), allow customizations (flexible), or let engineers use any technology as they saw fit (free). The fixed paved road for data was carefully chosen, considering the benefits to our customers and the productivity of our internal data producers and consumers.
We needed a new infrastructure to cover the entire lifecycle of data and machine learning activities from production to consumption of data. These capabilities consisted of the following broad categories:
Data Infrastructure
- Transactional persistence infrastructure
- Data catalog to discover and track data and lineage
- Extensible data ingestion framework
- Data pipeline orchestration and management
- Real-time distributed stream processing
- Data lake and lake-house infrastructure
ML Infrastructure
- Machine learning training and hosting infrastructure
- Model lifecycle management tools and workflows
- Feature engineering and feature store capabilities
Analytics Infrastructure
- Curated, global data model
- Unified user experience and data portal
- Data exploration, curation and enrichment
Outcomes and Benefits
As of late 2019, we successfully migrated our entire data and machine learning infrastructure to the cloud and modernized our core data technologies. We have seen 50 percent fewer operational issues since we cut over to the new platform. We are seeing 20X more model deployments, and the platform has helped decrease model deployment time by 99 percent. For our data analysts, a huge delighter has been the improvement in data freshness from multiple days down to an hour. We have significantly reduced the heavy lift while increasing confidence in how we perform FMEA (failure mode and effects analysis) tests, load tests and handle peak season traffic. We now have a unified real-time instrumentation and clickstream tracking infrastructure, making it much easier for internal consumers to find what they need in one place for all Intuit offerings. Across the company, we are now seeing a big focus on real-time stream processing, with hundreds of stream processors for ML and analytics, and a shift in how real-time code is deployed.
Migration and Maturation
The cloud journey for both transactional and analytical data and ML systems took two years to complete, with every team at Intuit involved throughout the transition, ranging from engineers in the data teams to business product teams, data scientists, analysts, product managers and program managers. It also took a very engaged and highly responsive cloud partner (Amazon Web Services) to quickly turn around product requests on security, data and pipeline migration. The approach focused on: 1) migrating on-premise data to the cloud, 2) rewriting producers and transactional systems to start ingesting into the cloud, 3) rewriting data processing pipelines in the cloud, and 4) operating two systems in parallel, with constant parity checking and validations before switching over completely.
Retrospective: What Worked Well
With any complex migration, there will be a need to maintain two completely independent and parallel data ecosystems until the last data consumer moves over to the new system. This introduces a potentially significant double-bubble cost for the duration of the migration, stressing all teams supporting the migration, as well as your budget. We had the highest level of support throughout the company to help fund the dual effort for a couple of years.
We obsessed about security, compliance and governance upfront in the implementation. Data observability, with visibility into any anomalies encountered in permissions, roles, policies, or movement of data was a critical part of the implementation. We partnered closely with our security team to continuously monitor our securability index and performed regular drills.
We designed an account structure that balances for reducing blast radius while keeping data transfer costs manageable. Once we had the structure in place, we defined the perimeters of the data lake; the security, governance, data access principles; and ownership and cost of data processing between producing and consuming teams. Our approach was to devise a hub-and-spoke account structure as explained in an earlier blog post on Provisioning the Intuit Data Lake in AWS.
Early on in our journey we realized the need for a comprehensive data catalog capability. Our needs went beyond what traditional catalog solutions offered, requiring additional capabilities for metadata and lineage around events, schemas, ML features, models, data products, and more. We have adopted open source Apache Atlas and tailored it for the various types of data objects we need. This has enabled us to connect the data from where it is produced, ingested and transformed, to the business outcomes the data drives.
We converged and leveraged Amazon SageMaker as the foundation for our ML platform, which freed up our engineers to focus on additional security, monitoring, scalability, and automated model development / lifecycle management workflows for training, testing and deployment, using Intuit’s open source Argo Workflows. We were also able to extend the platform with a model monitoring service that detects drift and automates re-training of our models.
Early on, we worked on defining a new company-wide clickstream tracking standard and drove its adoption across the company. This was critical in getting us to a unified instrumentation system resulting in improved freshness and data quality and giving the business better visibility across our portfolio of products.
It was easy for teams to spin off overlapping technologies with the various choices available through cloud and cloud native solutions. However, integrating them with other systems in the company and adding the necessary features for privacy, security and compliance was hard work. So, by declaring the paved road for data and the notion of fixed, flexible, and free, we were able to pool our efforts to focus and move with speed toward a common outcome.
Retrospective: Lessons Learned
Our data cloud migration was on track after a few false starts, once we pulled together the data producers, data consumers and data platform teams into one mission-based team with a real-time dashboard on impediments, progress, and data parity across on-premise and cloud pipelines. Once in place, data-backed dashboards increased the focus and alignment on common outcomes.
A lift-and-shift strategy that worked well for some of our transactional systems did not work so well with data and analytics systems. Cloud data solutions did not have a one-to-one mapping to our existing on-premise solutions, and required a complete rewrite of many capabilities like ingestion, processing, data classification, account management, and machine learning platform. Not knowing this from the get-go slowed us down by several months.
Instrumenting costs from day one would have helped provide visibility into our spend to help discover hidden costs, such as excessive logging or data transfer costs. Our teams needed to learn and pivot to a new way of managing hosting costs in the cloud, on a daily basis, and be accountable for it. Cost optimization is now a normal part of our engineering work and engineering managers are responsible for monitoring their budgets. Meticulous tagging of every provisioned component helps us quickly react to spend anomalies.
Migrating the long tail of data ingestion and processing pipelines proved a formidable challenge. Identifying the owners and retiring many jobs that were no longer relevant felt like a dinosaur excavation project. At one point, we decided to shut down our on-premise data systems to test what would break. In hindsight, we should have done this sooner rather than later.
Our Journey Continues
Almost every aspect of our legacy data ecosystem has now undergone some form of modernization, from our legacy transactional persistence stores to our legacy Hadoop grid, ingestion and processing technologies, clickstream tracking, ML capabilities, data marts, data catalog, real-time stream processing, feature engineering, and curation technologies. But the work continues.
Here are some of the areas we’re focusing on, going forward:
Data Mesh. The next chapter in our data journey is cultivating a mindset where our product teams treat data produced and consumed as an essential feature of their product. When data is treated like code, it creates accountability for quality, freshness, and availability of data, making it easier and speedier for data consumers to derive critical insights. To enable this shift, we have to provide an easy-to-use, self-serve data platform abstracting complexities, thereby allowing engineers and data scientists to build and manage domain specific data artifacts on their own. The platform also needs to provide semantic harmonization and governing standards for data and technologies. This cultural shift, with the right platform capabilities, will help democratize data-driven decisions as we scale up product teams and product experimentations throughout the company. In essence, this is our strategy as referenced in the seminal article on Data Mesh.
Permissions management. We do a lot centrally to monitor, observe, and alert on data access anomalies. However, in a decentralized model where the ownership of data is distributed amongst the producers of data, it’s best to let the owners take control of granting or revoking access to their data. They know their data best and, therefore, can make the right decisions on who should or shouldn’t have access. Today, we have the basic tools but we are actively working towards a more self-serve, fine-grained access control architecture.
Data processing orchestration. We have tens of thousands of data pipelines that bring data into the lake and transform them to be consumable for analytics and ML use cases. Most of these batch pipelines are currently using AWS EMR Hive or Spark, but we also use the Databricks Spark runtime as well as running open source Apache Spark on our own Kubernetes infrastructure. We are now in the process of building an orchestration abstraction layer to streamline the deployment and operations of these runtimes for the domain teams.
Feature engineering. Most recently we’ve added feature engineering as an integral part of our ML platform. It consists of a feature creation step, often done as stream processing in real-time, and a feature repository for accessing features at runtime. This is an active area of research and development for us and we are a leading voice in the feature engineering space.
Paradigm shift in Real-time processing. The foundation of our real-time data capabilities, Event Bus, is based on open source Apache Kafka, and provides our distributed eventing infrastructure. It is used widely for transactional, analytical, and machine learning purposes across the company. We are witnessing a paradigm shift in how new real-time applications, services, and features are deployed on our distributed eventing infrastructure and, for that, we have extended our stream processing platform with Apache Beam supporting Apache Flink and Apache Samza runtime engines. We expect to see an exponential growth in this space.
Data Curation. Our curation platform makes it easy to author custom domain processing logic and produce clean and connected entities into our data graph. This ensures our data consumers can have access to the highest quality data available. At the same time, it’s a critical component in empowering data producers to take ownership of producing consumable data. Our vision is to curate all data produced at Intuit through this platform.
Operability and cost management. As data systems get more complex requiring quality and freshness of data to be guaranteed in real time, and as data and compute grow at an exponential rate, operability becomes more and more critical. For example, we need near real-time SLAs (service-level agreements) for processing large amounts of clickstream data and ML feature extraction for use cases like fraud detection at a manageable cost point. We are constantly improving observability and cost optimizations through monitoring waste sensors and trying out alternative technologies.
We are continuously innovating in this space, so stay tuned for more!